Performance Tuning Of Apache Spark Framework In Big Data Processing with Respect To Block Size And Replication Factor
نویسندگان
چکیده
Apache Spark has recently become the most popular big data analytics framework. Default configurations are provided by Spark. HDFS stands for Hadoop Distributed File System. It means large files will be physically stored on multiple nodes in a distributed fashion. The block size determines how distributed, while replication factor reliable are. If there is just one copy of each given file and node fails, unreadable. configurable per file. results analysis experimental study to determine efficiency adjusting settings tuning minimizing application execution time as compared standard values described this paper. Based vast number studies, we employed trial-anderror strategy fine-tune these values. We chose two workloads test framework comparative analysis: Wordcount Terasort. used elapsed evaluate same.
منابع مشابه
A comparison on scalability for batch big data processing on Apache Spark and Apache Flink
*Correspondence: [email protected] 1Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, Calle Periodista Daniel Saucedo Aranda, 18071 Granada, Spain Full list of author information is available at the end of the article Abstract The large amounts of data have created a need for new fram...
متن کاملa comparison of teachers and supervisors, with respect to teacher efficacy and reflection
supervisors play an undeniable role in training teachers, before starting their professional experience by preparing them, at the initial years of their teaching by checking their work within the proper framework, and later on during their teaching by assessing their progress. but surprisingly, exploring their attributes, professional demands, and qualifications has remained a neglected theme i...
15 صفحه اولStatic and Dynamic Big Data Partitioning on Apache Spark
Many of today’s large datasets are organized as a graph. Due to their size it is often infeasible to process these graphs using a single machine. Therefore, many software frameworks and tools have been proposed to process graph on top of distributed infrastructures. This software is often bundled with generic data decomposition strategies that are not optimised for specific algorithms. In this ...
متن کاملTowards Large Scale Environmental Data Processing with Apache Spark
Currently available environmental datasets are either manually constructed by professionals or automatically generated from the observations provided by sensing devices. Usually, the former are modelled and recorded with traditional general-purpose relational technologies, whereas the latter require more specific scientific array formats and tools. Declarative data processing technologies are a...
متن کاملSPARQL query processing with Apache Spark
The number and the size of linked open data graphs keep growing at a fast pace and confronts semantic RDF services with problems characterized as Big data. Distributed query processing is one of them and needs to be efficiently addressed with execution guaranteeing scalability, high availability and fault tolerance. RDF data management systems requiring these properties are rarely built from sc...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: SAMRIDDHI : A Journal of Physical Sciences, Engineering and Technology
سال: 2022
ISSN: ['2229-7111', '2454-5767']
DOI: https://doi.org/10.18090/samriddhi.v14i02.4